To kill or to be killed: pangenome analysis of Escherichia coli strains reveals a tailocin specific for pandemic ST131

Background Escherichia coli (E. coli) has been one of the most studied model organisms in the history of life sciences. Initially thought just to be commensal bacteria, E. coli has shown wide phenotypic diversity including pathogenic isolates with great relevance to public health. Though pangenome analysis has been attempted several times, there is no systematic functional characterization of the E. coli subgroups according to the gene profile. Results Systematically scanning for optimal parametrization, we have built the E. coli pangenome from 1324 complete genomes. The pangenome size is estimated to be ~25,000 gene families (GFs). Whereas the core genome diminishes as more genomes are added, the softcore genome (≥95% of strains) is stable with ~3000 GFs regardless of the total number of genomes. Apparently, the softcore genome (with a 92% or 95% generation threshold) can define the genome of a bacterial species listing the critically relevant, evolutionarily most conserved or important classes of GFs. Unsupervised clustering of common E. coli sequence types using the presence/absence GF matrix reveals distinct characteristics of E. coli phylogroups B1, B2, and E. We highlight the bi-lineage nature of B1, the variation of the secretion and of the iron acquisition systems in ST11 (E), and the incorporation of a highly conserved prophage into the genome of ST131 (B2). The tail structure of the prophage is evolutionarily related to R2-pyocin (a tailocin) from Pseudomonas aeruginosa PAO1. We hypothesize that this molecular machinery is highly likely to play an important role in protecting its own colonies; thus, contributing towards the rapid rise of pandemic E. coli ST131. Conclusions This study has explored the optimized pangenome development in E. coli. We provide complete GF lists and the pangenome matrix as supplementary data for further studies. We identified biological characteristics of different E. coli subtypes, specifically for phylogroups B1, B2, and E. We found an operon-like genome region coding for a tailocin specific for ST131 strains. The latter is a potential killer weapon providing pandemic E. coli ST131 with an advantage in inter-bacterial competition and, suggestively, explains their dominance as human pathogen among E. coli strains. Supplementary Information The online version contains supplementary material available at 10.1186/s12915-022-01347-7.


Supplementary Figure S4: The distribution of virulence categories across the 21 most common sequence types of E. coli
The distribution of virulence categories across the 21 common sequence types of E. coli ordered according to its phylogroups. Based on the total number of virulence factors (VFs) present in the genome, we categorized the genome into four virulence categories, i.e. (1) likely nonpathogenic (#VFs <6); (2) likely virulence (6 <= #VFs <14); (3) high virulence (14 <= #VFs < 22) and (4) very high virulence (#VFs >= 22). The phylogroup B1* represents phylogroup B1 with shiga toxin. Figure S5: Distribution of COG categories in GFs of the reference genome and in the softcore genome The distribution of COG categories for the gene families in (A) E. coli reference in the COG database; and (B) the softcore genome.

Supplementary Figure S6: Distribution of COG categories in GFs in ST 131
The distribution of COG categories for the gene families that are (A) common in E. coli ST131; and (B) rare in E. coli ST131.

Supplementary Figure S7: The presence of s1-ST131or s2-ST131 in ST131 and other E. coli genomes
The distribution of BLASTN coverage for (A) s1-ST131 and (B) s2-ST131 clusters in ST131 genomes; and non-ST131 genomes. The presence of s1-ST131or s2-ST131 cluster is shown by BLASTN coverage of 95% -100% whereas the absence of these clusters are shown in the BLASTN coverage 0% -5%. The partial presence of these clusters is shown in between 5% to 95%. Figure S8: Sequences similar to s1-ST131 among non-E. coli genomes The top 20 hits of NCBI BLASTN to the nr-database excluding E. coli genomes for the s1-ST131 cluster.

Supplementary Figure S10: Genome browser results of the s2-ST131 cluster
Genome browser results of the s2-ST131 cluster based upon GCF_000931565.1 as the representative E. coli ST131. The shown region is on chromosome NZ_CP010876.1 (2,042,977 -2,066,794). The highlighted regions represent the nanomachine (tailocin), the capsid and the lysis-related genes.

Supplementary Figure 11: The presence of s1-ST11or s2-ST11 in ST11 and other E. coli genomes
The distribution of BLASTN coverage for (A) s1-ST11 and (B) s2-ST11 clusters in the ST11 genomes; and non-ST11 genomes. The presence of s1-ST11 or s2-ST11 cluster is shown by BLASTN coverage of 95% -100% whereas the absence of these clusters are shown in the BLASTN coverage 0% -5%. The partial presence of these clusters is shown in between 5% to 95%.

Supplementary Figure S16: Histogram of the -log10 P-value generated with CoinFinder
Histogram of the -log10 P-value of the pairwise GF association generated from the CoinFinder output for the all pairwise comparisons. The figure on the right is the enlarged section of the distribution with the y-axis truncated at 10 6 . The P-value 1 x 10 -20 is selected as the ad hoc cut-off criterion for significant pairwise comparisons in this study.

Supplementary Figure S17: Distribution of significantly associated GFs (associated GF cluster sizes)
The distribution of number of associated GFs for each significant GF (P-value <= 1 x 10 -20 ). The number of associated GFs for each significant GF ranges from 1 to 607. Though there are overwhelmingly high number of GFs with fewer than 50 associated GFs, there are quite a substantial number of GFs with many associated GFs as well, especially those with more than 300 associated GFs.